This is an natural language analysis on the matching soccer teams’ name when I am doing research on Betting Strategy and Model Validation. Where the subject/topic is that the last course Data Science Capstone on Coursera (JHU Johns Hopkins University) which I have failed few times and will retake on this coming October-2015 (Next month).
Note that the echo = FALSE and include=FALSE parameters were added to the code chunks below to prevent printing of the R code that generated the plots/tables. However you can feel free to see the source code via Natural Language Analysis.Rmd.
Setup knitr options and loading the required libraries.
Creating a parallel computing Cluster and support functions.
Read the dataset of World Wide soccer matches from year 2011 until 2015 from a British betting consultancy named firm A.
table 2.1 48744 x 17
Due to the dataset very big 48744 x 17 caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.
Read the dataset of World Wide soccer matches scrapped from year 2011 until 2015 from spbo livescore website.
table 2.2 488929 x 20
Due to the dataset very big 488929 x 20 caused the webpage keep loading and unable open. Here I just only subset few rows from the data frame.
In order to matching a string. Firstly we can apply match() or %in% to matching the teams’ name. Although, the capital letter different is not duplicated string in R programming while I apply the tolower() to match the teams’ name since it is consider exactly matching teams’ name in our real life.
| team | spbo | pass |
|---|---|---|
| 3 de Febrero | 3 de Febrero | Duplicated |
| Aachen | Aachen | Duplicated |
| Aalesund | Aalesund | Duplicated |
| 12 de Octubre | 12 De Octubre | Capital Letters |
| Argentinos Juniors | Argentinos juniors | Capital Letters |
| EsPa | ESPA | Capital Letters |
table 3.1.1 1190 x 3
There has a concern which is noramlly second teams’ name must be exactly same with first team but only add II, reserved etc to the first team name, for example : Mainz 05 is first team but not fifth reserved team. More soccer matches data scrapped will be more accurate, for example if we only scrapped one day data, how can we matching the first team if let say only Chelsea reserved team play on that particular date.
However there has another concern which is first team TSV 1860 Munchen but second/U19 team termed as 1860 Munchen II, 1860 Munchen U19 etc. The Lincoln team name supposed to be matched with Lincoln City but not Lincoln United while Lincoln City will be most approximately matching to Lincoln Xxitxx compare to Lincoln.
Besides, if I set the priority of matching the kick-off date and later team names, it will be a concern of possibilities of postponed staked matches (postponed after firm A placed bets, sometimes firm A will placed bets on Early market or the kick-off date accidentially changed/postponed before kick-off due to snowing/downpour/etc).
I load the stringdist package to apply the algorithmic matching amatch() the team names.
Lets take an example below.
[1] “Lincoln City”
table 3.2.1 10 x 12
I simply matching the key words Lincoln in Home and Away teams’ name data which get from firm A.
table 3.2.2 10 x 12
From the two tables stated above, I apply stringdist by set the MaxDist to be default value 0.1,0.5,1.0,2.0 and also Inf and select all methods avaiable (10 methods stated above in section 3 before the run coding). Well, I dont pretend to know how does the algorimthic of stringdist() matching the string. Therefore I try both unique teams’ name and also all elements (without filter to be unique).
I tried to simply apply the agrep() function to partially matching the teams’ name.
| Matching1 | team1 | spbo1 | Matching2 | team2 | spbo2 |
|---|---|---|---|---|---|
| Lincoln | Lincoln City | Lincoln | Lincoln City | Lincoln City | NA |
| Lincoln | NA | Lincoln (MO) | Lincoln City | NA | NA |
| Lincoln | NA | Lincoln (Pa.) | Lincoln City | NA | NA |
| Lincoln | NA | Lincoln Red Imps | Lincoln City | NA | NA |
| Lincoln | NA | Lincoln Reserve | Lincoln City | NA | NA |
| Lincoln | NA | Lincoln United | Lincoln City | NA | NA |
| Lincoln | NA | Lincoln Women | Lincoln City | NA | NA |
| Lincoln | NA | Rivadavia Lincoln | Lincoln City | NA | NA |
table 3.3.1 8 x 6
Secondly, there is an article from Merging Data Sets Based on Partially Matched Data Elements which apply subset to partial matching the teams’ name.
Below table simply display few matched teams’ name which are not accurate.
| teamID | spboID | Match |
|---|---|---|
| AaB Aalborg | AaB Aalborg U17 | Partial |
| Airdrie United | Airdrie United Women | Partial |
| AS Trencin | AS Trencin U19 | Partial |
| Gremio Barueri | Gremio Barueri SP U20 | Partial |
| Sao Caetano | Sao Caetano Women | Partial |
| Sheffield United | Chesterfield United Women | Partial |
table 3.4.2 1306 x 3
From the table above we all know that the team AaB Aalborg from firm A will match with AaB Aalborg U17 from livescore website and Airdrie United match to Airdrie United Women while there are totally different team and will lead reasearcher calculate a wrong predictive figures for investment.
In order to maximized the soccer matches (observations) available for the research, here I seperates few steps to matching the teams’ name by using split() and cross-matching each others to seperately rearrange the data prior to start the algorithmic matching function in section 4 Reprocess the Data.
I would like to plot a hierarchical chart for spliting teams’ name for agrep. However due to rpart and randomForest packages required numeric data while diagram doesn’t special. Here I plot two dynamic graphs.
Since the simpleNetwork() function only apply to 2 columns dataset, here I split to be 2 graphs.
Prior to start the algorithmic string matching, I am using the idea from Apply signature() from country names to reduce some of the minor differences between strings. In this case, convert all characters to lower case, sort the words alphabetically, and then concatenate them with no spaces. So for example, United Kingdom would become kingdomunited which inside the Merging Data Sets Based on Partially Matched Data Elements. It will minimize/reduce the string distance to maximize the matching result.
Here I tried to split teams’ name into list and simply apply grep and agrep to apply first filtering.
There is an good example from How can I match fuzzy match strings from two datasets? which apply expand.grid() to build a data frame and then Expectation Maximization theory by using while loop on stringdist().
From the above table, I’ve matching the teams’ name which is Section 2 Dataset inside Betting Strategy and Model Validation. Here I apply method = osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, soundex inside the stringdist function. You are feel free to apply the function to scrap and also re-arrange the teams’ name and soccer scores data for your own odds price modelling.
It’s useful to record some information about how your file was created.
rmarkdown package version: 0.8.1[1] “2015-11-22 10:38:00 JST” setting value
version R version 3.2.2 (2015-08-14) system x86_64, mingw32
ui RTerm
language (EN)
collate English_United States.1252
tz Asia/Tokyo
date 2015-11-22
sysname release version nodename machine “Windows” “7 x64” “build 9200” “SCIBROKES” “x86-64” login user effective_user “Scibrokes” “Scibrokes” “Scibrokes”